#install packages
#only run once
#install.packages("tidyverse")
#install.packages("dplyr")
#install.packages("gt")
#install.packages("ggplot2")
#install.packages("reshape2")
#install.packages("scales")
#install.packages("gmodels")
#install.packages("DT")
#install.packages("corrplot")
#install.packages("Hmisc")
#install.packages("car")
#install.packages("BAMMtools")
#install.packages("broom")
#install.packages("stargazer")
#run every session
library(tidyverse)
library(dplyr)
library(gt)
library(ggplot2)
library(reshape2)
library(scales)
library(gmodels)
library(DT)
library(corrplot)
library(Hmisc)
library(car)
library(gridExtra)
library(BAMMtools)
library(broom)
library(stargazer)
#set colors
palette <- c("#a02300", "#0059c7", "#6496b4", "#E1B93C")
#red and gold are accent
#blue is chart
#dark blue is outline
cwns2012 <- read_csv("https://raw.githubusercontent.com/c-townsley/pbn_a2b_repo/main/Data/cwns2012.csv?token=GHSAT0AAAAAACAEJN5DRTWOHJRNE36CS67GZAZCUFQ")To: Mayor Kenney
From: Oliver Atwood and Charlie Townsley, O+C
Analytics
Subject: The Water Crisis in Philadelphia
Hey Mayor Kenney, we know you’re a busy man, but we’ve got something important to say! We’ve got a real problem on our hands, and it’s not just about the potholes in the streets or the trash in the alleys. It’s about the water.
Using the magic of computers, our team conducted a statistical analysis of national wastewater treatment facility data, and we found that Philadelphia is in desperate need of some serious investment in our water treatment infrastructure. We’re talking about $1.29 billion, Mayor, over $400 million more than the national average! That’s a lot of cash.
Now, I know what you’re thinking. “Why should I care about this?” Well, let me tell you, Mayor. Without clean water, we’ve got nothing. No coffee, no beer, no nothing! And that’s just the beginning. A lack of investment in water infrastructure can lead to serious health risks, economic losses, and environmental degradation. We’re talking sewage in the rivers, Mayor!
We’ve found that the level of investment needed is correlated with many other values nationally, including population density, low median income, and high projected changes in population density and number of residential households receiving treatment. Another correlation we unearthed was one of investment need with use of a combined sewer system. Remember that sewage in the rivers I mentioned earlier? Well, if you’re using a combined sewer and it rains, even a little bit, bam! Sewage in the rivers.
Now our analysis has truly been world class, but we need to dig deeper, Mayor. We need to figure out exactly what’s going on in our city and come up with a plan to fix it. That’s why we’re requesting an additional round of research funding to look into this matter further. Specifically, we would like to further explore the relationships between need combined sewer systems, and systems that contribute to highly polluted water bodies. We need to understand the unique challenges we’re facing and come up with a solution before it’s too late.
So, Mayor Kenney, we’re counting on you to take this seriously. We need to invest in our water infrastructure if we want to have a healthy, thriving city. Let’s get to work.
Sincerely,
Oliver Atwood and Charlie Townsley O+C Analytics
Update your conclusion to the report to Mayor Kenney summarizing your findings. What additional variables would be helpful to include? What are the major implications of your findings?
Highlight conclusion updates
We generated the below metrics to measure potential drivers of reported need.
cwns2012 <- cwns2012 %>%
mutate(proj_change_res_rec_collctn = PROJ_RES_REC_COLLCTN - PRES_RES_REC_COLLCTN) %>%
mutate(proj_prcntchng_res_rec_collctn = (proj_change_res_rec_collctn / PRES_RES_REC_COLLCTN)*100) %>%
mutate(proj_change_res_rec_trmt = PROJ_RES_REC_TRMT - PRES_RES_REC_TRMT) %>%
mutate(proj_prcntchng_res_rec_trmt = (proj_change_res_rec_trmt / PRES_RES_REC_TRMT)*100) %>%
mutate(proj_chng_n_res_rec_collctn = PROJ_N_RES_REC_COLLCTN - PRES_N_RES_REC_COLLCTN) %>%
mutate(proj_prcntchng_n_res_rec_collctn = (proj_chng_n_res_rec_collctn / PRES_N_RES_REC_COLLCTN)*100) %>%
mutate(proj_chng_n_res_rec_trmt = PROJ_N_RES_REC_TRMT - PRES_N_RES_REC_TRTM) %>%
mutate(proj_prcntchng_n_res_rec_trmt = (proj_chng_n_res_rec_trmt / PRES_N_RES_REC_TRTM)*100) %>%
mutate(res_pop10_dens = (POP10/(ALAND + AWATER))*1e+6) %>%
mutate(res_pop00_dens = (POP00/(ALAND + AWATER))*1e+6) %>%
mutate(res_pop90_dens = (POP90/(ALAND + AWATER))*1e+6) %>%
mutate(res_pop80_dens = (POP80/(ALAND + AWATER))*1e+6) %>%
mutate(res_pop_dens_chng_80_10 = (res_pop10_dens - res_pop80_dens)) %>%
mutate(res_pop_dens_prcnt_chng_80_10 = (res_pop_dens_chng_80_10 / res_pop80_dens)*100) %>%
mutate(medinc_prcnt_chng_69_09 = ((MEDINC09 - MEDINC69)/MEDINC69)*100)
#r clear Nans and Inf values
cwns2012[sapply(cwns2012, is.nan)] <- 0
cwns2012[sapply(cwns2012, is.infinite)] <- 0
#Interactive Summary Table of Resilience Measures by EPA Region
cwns2012 %>%
dplyr::select(TOTAL_OFFICIAL_NEED, EPA_REGION, proj_change_res_rec_collctn, proj_prcntchng_res_rec_collctn,
proj_change_res_rec_trmt, proj_prcntchng_res_rec_trmt, proj_chng_n_res_rec_collctn,
proj_prcntchng_n_res_rec_collctn, proj_chng_n_res_rec_trmt, proj_prcntchng_n_res_rec_trmt,
res_pop10_dens, res_pop10_dens, res_pop00_dens, res_pop90_dens, res_pop80_dens,
res_pop_dens_chng_80_10, medinc_prcnt_chng_69_09) %>%
DT::datatable(., options = list(scrollX = TRUE))Projected Change in Residential Population Receiving Collection
summary(cwns2012$proj_change_res_rec_collctn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -88018.0 19.0 275.5 4636.7 1800.0 1300000.0
Projected Percent Change in Residential Population Receiving Collection
summary(cwns2012$proj_prcntchng_res_rec_collctn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.000 1.623 12.386 46.842 35.634 29900.000
Projected Change in Residential Population Receiving Treatment
summary(cwns2012$proj_change_res_rec_trmt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -187386 31 326 5477 2096 1300028
Projected Percent Change in Residential Population Receiving Treatment
summary(cwns2012$proj_prcntchng_res_rec_trmt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100 2 14 2661 38 21999900
Projected Change in Non-Residential Population Receiving Collection
summary(cwns2012$proj_chng_n_res_rec_collctn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -254000.0 0.0 0.0 376.8 0.0 354330.0
Projected Percent Change in Non-Residential Population Receiving Collection
summary(cwns2012$proj_prcntchng_n_res_rec_collctn)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.000 0.000 0.000 9.732 0.000 5256.000
Projected Change in Non-Residential Population Receiving Treatment
summary(cwns2012$proj_chng_n_res_rec_trmt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -254000.0 0.0 0.0 437.9 0.0 354330.0
Projected Percent Change in Non-Residential Population Receiving Treatment
summary(cwns2012$proj_prcntchng_n_res_rec_trmt)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.00 0.00 0.00 13.03 0.00 19921.43
Residential Population Density 2010 units are persons per kilometer2
summary(cwns2012$res_pop10_dens)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.084 13.803 37.982 165.804 135.014 18179.902
Change in Population Density 1980 - 2010
summary(cwns2012$res_pop_dens_chng_80_10)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -781.8179 -0.2706 4.0604 27.9594 30.7569 1806.5346
Percent Change in Population Density 1980 - 2010
summary(cwns2012$res_pop_dens_prcnt_chng_80_10)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -55.684 -1.915 16.345 35.744 47.968 1034.914
Percent Change in Median Income 1969 - 2009
summary(cwns2012$medinc_prcnt_chng_69_09)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 529.5 592.4 606.3 668.6 1379.0
Tables and figures that describe the trends we see in the data, beginning with the top and bottom 10 facilities with the greatest and least need for water infrastructure.
#Create a new dataframe with the top ten facilities by need
need_top10 <- cwns2012 %>%
arrange(desc(TOTAL_OFFICIAL_NEED)) %>%
filter(TOTAL_OFFICIAL_NEED != 0) %>%
slice(1:10)
#Create a table with the top ten facilities by need and geography
need_top10 %>%
dplyr::select(TOTAL_OFFICIAL_NEED, FACILITY_CITY, COUNTYNAME, STATE, EPA_REGION) %>%
gt() %>%
tab_header(title = "Top Ten Facilities by Need") %>%
fmt_currency(columns = c(TOTAL_OFFICIAL_NEED), currency = "USD")| Top Ten Facilities by Need | ||||
| TOTAL_OFFICIAL_NEED | FACILITY_CITY | COUNTYNAME | STATE | EPA_REGION |
|---|---|---|---|---|
| $5,160,108,526.00 | JAMAICA | Queens | NY | 2 |
| $3,945,228,972.00 | NEW YORK | Kings | NY | 2 |
| $3,289,580,276.00 | Miami | Miami-Dade | FL | 4 |
| $2,480,700,000.00 | HARTFORD | Hartford | CT | 1 |
| $2,428,761,777.00 | ST. LOUIS | St. Louis | MO | 7 |
| $2,269,619,593.00 | NA | Los Angeles | CA | 9 |
| $2,163,499,139.00 | CINCINNATI | Hamilton | OH | 5 |
| $2,140,388,600.00 | NEW YORK | New York | NY | 2 |
| $2,096,708,793.00 | INDIANAPOLIS | Marion | IN | 5 |
| $2,036,990,483.00 | Miami | Miami-Dade | FL | 4 |
#Create a new dataframe with the bottom ten facilities by need
need_bot10 <- cwns2012 %>%
arrange(TOTAL_OFFICIAL_NEED) %>%
filter(TOTAL_OFFICIAL_NEED != 0) %>%
slice(1:10)
#Create a table with the bottom ten facilities by need and geography
need_bot10 %>%
dplyr::select(TOTAL_OFFICIAL_NEED, FACILITY_CITY, COUNTYNAME, STATE, EPA_REGION) %>%
gt() %>%
tab_header(title = "Bottom Ten Facilities by Need") %>%
fmt_currency(columns = c(TOTAL_OFFICIAL_NEED), currency = "USD")| Bottom Ten Facilities by Need | ||||
| TOTAL_OFFICIAL_NEED | FACILITY_CITY | COUNTYNAME | STATE | EPA_REGION |
|---|---|---|---|---|
| $3,442.00 | Maple Grove | Hennepin | MN | 5 |
| $4,219.00 | NA | Delaware | PA | 3 |
| $6,748.00 | TANGENT | Linn | OR | 10 |
| $7,608.00 | NEW HARTFORD | Litchfield | CT | 1 |
| $7,826.00 | NA | Trempealeau | WI | 5 |
| $10,332.00 | NA | Washington | AR | 6 |
| $11,375.00 | NA | Cook | IL | 5 |
| $12,865.00 | WILD ROSE | Waushara | WI | 5 |
| $14,368.00 | NA | Walworth | WI | 5 |
| $15,204.00 | ROCKLAND | Plymouth | MA | 1 |
Mean Total Official Need
total_need_mean <- mean(cwns2012$TOTAL_OFFICIAL_NEED)
total_need_mean## [1] 21623215
Mean Official Need in Philadelphia
philly_facilities <- cwns2012 %>%
filter(STATE == "PA", FACILITY_CITY == "PHILADELPHIA")
need_philly_mean <- mean(philly_facilities$TOTAL_OFFICIAL_NEED)
need_philly_mean## [1] 429336765
Total Official Need in Philadelphia
need_philly_total <- sum(philly_facilities$TOTAL_OFFICIAL_NEED)
need_philly_total## [1] 1288010295
Difference between Mean Official Need in Philadelphia and
National Mean Total Official Need
Mean budgeted need of Philadelphia facilities is $407,713,550 higher
than mean total official need.
need_philly_mean - total_need_mean## [1] 407713550
Calculate additional means of density and median income
density_philly_mean <- mean(philly_facilities$res_pop10_dens)
medincome_philly_mean <- mean(philly_facilities$MEDINC09)
print('Mean Philadelphia Density')## [1] "Mean Philadelphia Density"
density_philly_mean## [1] 4128.701
print('Mean Philadelphia Median Income')## [1] "Mean Philadelphia Median Income"
medincome_philly_mean## [1] 45842
Test distribution of log-adjusted Total Official
Need
Mean of all facilities in gold
Mean of Philly facilities in red
cwns2012 <- cwns2012 %>%
mutate(log_TOTAL_NEED = ifelse(TOTAL_OFFICIAL_NEED > 0, log(TOTAL_OFFICIAL_NEED), TOTAL_OFFICIAL_NEED))
ggplot(cwns2012, aes(x=log_TOTAL_NEED)) +
geom_histogram(color="#0059c7", size=.1, fill="#6496b4", binwidth=.5, alpha=0.8) +
geom_vline(aes(xintercept=mean(log_TOTAL_NEED)), color="#E1B93C") +
geom_vline(aes(xintercept=mean(log(philly_facilities$TOTAL_OFFICIAL_NEED))), color="#a02300") +
theme_bw()Test distribution of Median Income
Mean of all facilities in gold
Mean of Philly facilities in red
cwns2012%>%
ggplot(aes(x=MEANINC09)) +
ggtitle("Median Income Distribution") +
xlab("Distribution of Median Income (2009)") + ylab("Median Income") +
geom_density(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
geom_vline(aes(xintercept=mean(MEANINC09)), color="#E1B93C") +
geom_vline(aes(xintercept=mean(philly_facilities$MEANINC09)), color="#a02300") +
theme_bw()#Summarize total need and group by EPA region
mean_density_need_epa <- cwns2012 %>%
group_by(EPA_REGION) %>%
summarise(count = n(), mean_popdens = mean(res_pop10_dens), mean_need = mean(TOTAL_OFFICIAL_NEED))
ggplot(mean_density_need_epa, aes(x = mean_popdens, y = mean_need)) +
geom_smooth(method=lm, color="#0059c7") +
geom_point(size=3, color="black", fill="#6496b4", shape=21, alpha=.5) +
ggtitle("Population Density vs. Total Need, by EPA Region") +
geom_text(aes(label = rownames(mean_density_need_epa), hjust = 1.75, vjust = 1)) +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need") +
theme_bw()Density and need seem to be correlated. So how does the mean need of Philadelphia facilities compare with the mean values for each EPA region?
Here is the same plot with Philadelphia shown as a red dot.
mean_density_need_epa <- cwns2012 %>%
group_by(EPA_REGION) %>%
summarise(count = n(), mean_popdens = mean(res_pop10_dens), mean_need = mean(TOTAL_OFFICIAL_NEED))
ggplot(mean_density_need_epa, aes(x = mean_popdens, y = mean_need)) +
geom_smooth(method=lm, color="#0059c7") +
geom_point(size=3, color="black", fill="#6496b4", shape=21, alpha=.5) +
ggtitle("Population Density vs. Total Need, by EPA Region") +
geom_text(aes(label = rownames(mean_density_need_epa), hjust = 1.75, vjust = 1)) +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need") +
geom_point(aes(x = density_philly_mean , y = need_philly_mean),size=3, shape=21,color="black", fill="#a02300") +
theme_bw()How does Philly’s density and need compare to other cities in the same EPA region?
#Summarize total need in EPA region 3
mean_density_need_R3 <- cwns2012 %>%
filter(EPA_REGION == '3') %>%
group_by(FACILITY_CITY) %>%
summarise(count = n(), mean_popdens = mean(res_pop10_dens), mean_need = mean(TOTAL_OFFICIAL_NEED))
ggplot(mean_density_need_R3, aes(x = mean_popdens, y = mean_need)) +
geom_point(size=3, color="black", fill="#6496b4", shape=21, alpha=.5) +
ggtitle("Population Density vs. Total Need by City in EPA Region 3") +
xlab("Mean Population Density (2010)") + ylab("Mean Official Need") +
geom_point(aes(x = density_philly_mean , y = need_philly_mean),size=3, shape=21,color="black", fill="#a02300") +
theme_bw()What about income and need in the same EPA region?
#Summarize total need in EPA region 3
mean_income_need_R3 <- cwns2012 %>%
filter(EPA_REGION == '3') %>%
group_by(FACILITY_CITY) %>%
summarise(count = n(), mean_medinc = mean(MEDINC09), mean_need = mean(TOTAL_OFFICIAL_NEED))
ggplot(mean_income_need_R3, aes(x = mean_medinc, y = mean_need)) +
geom_point(size=3, color="black", fill="#6496b4", shape=21, alpha=.5) +
ggtitle("Median Income vs. Total Need by City in EPA Region 3") +
xlab("Median Income (2009)") + ylab("Mean Official Need") +
geom_point(aes(x = medincome_philly_mean , y = need_philly_mean),size=3, shape=21,color="black", fill="#a02300") +
theme_bw()Summary of overall budgeted need for facilities with CSS (Combined Sewer Systems):
cwns2012_css <- filter(cwns2012, CSS == 1)
summary(cwns2012_css$TOTAL_OFFICIAL_NEED)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 2.833e+06 8.814e+06 1.131e+08 3.625e+07 5.160e+09
Summary of overall budgeted need for facilities with MS4s (Municipal Separate Storm Sewer Systems):
cwns2012_ms4 <- filter(cwns2012, MS4 == 1)
summary(cwns2012_ms4$TOTAL_OFFICIAL_NEED)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 6.281e+05 2.162e+06 1.338e+07 6.708e+06 3.290e+09
We can see that mean budgeted need for facilities with CSS is higher than for facilities with MS4s, but is this difference statistically meaningful?
Conducting a Welch two-sample t-test reveals that these means are significantly different from each other.
It gives a t-statistic that is greater than the critical value for significance of 1.96 and allows us reject the null hypothesis that the mean budgeted need is the same for CSS and MS4s with greater than 95% confidence.
t.test(TOTAL_OFFICIAL_NEED ~ CSS, data = cwns2012, paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by CSS
## t = -6.712, df = 699.19, p-value = 3.972e-11
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -128823442 -70514032
## sample estimates:
## mean in group 0 mean in group 1
## 13384564 113053301
Budgeted need for facilities that contribute to a TMDL receiving water body:
cwns_yestmdl <- filter(cwns2012, TMDL_INDICATOR == "Y")
summary(cwns_yestmdl$TOTAL_OFFICIAL_NEED)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.249e+06 3.862e+06 5.900e+07 1.437e+07 3.945e+09
Budgeted need for facilities that do not contribute to a TMDL receiving water body:
cwns_notmdl <- filter(cwns2012, TMDL_INDICATOR == "N")
summary(cwns_notmdl$TOTAL_OFFICIAL_NEED)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 6.908e+05 2.372e+06 2.023e+07 7.537e+06 5.160e+09
Mean budgeted need for facilities that do contribute to a TMDL is higher than for facilities that do not.
Running a Welch two sample t-test shows that mean budgeted need is statistically different between facilities that do and do not contribute to a TMDL.
It gives a t-statistic that surpasses the critical value for significance of 1.96 and allows us reject the null hypothesis that the means are the same with greater than 95% confidence.
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = cwns2012, paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -2.2283, df = 305.8, p-value = 0.02658
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -73003450 -4533819
## sample estimates:
## mean in group N mean in group Y
## 20228097 58996731
We found a statistically significant difference in mean budgeted need between facilities that do contribute to a TMDL and those that do not in just four out of ten EPA Regions. These were regions 2, 3, 5, and 6.
Because we could not reject the null hypothesis (that means are different) in 60% of EPA regions, we do not see TMDL by EPA region as a useful metric for further exploration.
Region 2
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 2), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -3.5245, df = 5.0045, p-value = 0.01681
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -3033874754 -475206220
## sample estimates:
## mean in group N mean in group Y
## 39662349 1794202835
Region 3
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 3), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 2.6523, df = 31.104, p-value = 0.01247
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## 2196333 16806577
## sample estimates:
## mean in group N mean in group Y
## 18403975 8902520
Region 5
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 5), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 3.9798, df = 1436.2, p-value = 7.241e-05
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## 5244922 15440674
## sample estimates:
## mean in group N mean in group Y
## 15088120 4745322
Region 6
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 6), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -3.074, df = 63.917, p-value = 0.003104
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -80860108 -17158802
## sample estimates:
## mean in group N mean in group Y
## 9455932 58465387
We found no statistically significant difference (failed to reject the null hypothesis) in mean budgeted need between facilities that do contribute to a TMDL and those that do not in regions 1, 4, 9, and 10.
Region 1
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 1), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -0.85922, df = 3.0193, p-value = 0.453
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -542766502 311336752
## sample estimates:
## mean in group N mean in group Y
## 32685334 148400209
Region 4
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 4), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 0.12885, df = 18.325, p-value = 0.8989
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -17470739 19756906
## sample estimates:
## mean in group N mean in group Y
## 22352317 21209234
Region 9
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 9), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = -0.87332, df = 1.0125, p-value = 0.5415
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -1749036595 1517883950
## sample estimates:
## mean in group N mean in group Y
## 65458334 181034656
Region 10
t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 10), paired = FALSE)##
## Welch Two Sample t-test
##
## data: TOTAL_OFFICIAL_NEED by TMDL_INDICATOR
## t = 1.2645, df = 252, p-value = 0.2072
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -4907121 22511440
## sample estimates:
## mean in group N mean in group Y
## 26134688 17332528
Regions 7 and 8
Regions 7 and 8 both failed the t-test outright because there was not
enough data.
# t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 7), paired = FALSE)
# t.test(TOTAL_OFFICIAL_NEED ~ TMDL_INDICATOR, data = filter(cwns2012, EPA_REGION == 8), paired = FALSE)To test for association between total budgeted need and EPA region, we first need to divide total need into a limited number of categories that a cross table can process.
#First create a new column that breaks total official need into six classes
cwns_int <- cwns2012 %>%
mutate(need_int = ntile(cwns2012$TOTAL_OFFICIAL_NEED, 5)) %>%
dplyr::select(need_int, TOTAL_OFFICIAL_NEED)Chi-Square test of association between need and EPA region shows that we can confidently reject the null hypothesis that need and epa region are independent variables due to the incredibly small p-value.
CrossTable(cwns2012$EPA_REGION, cwns_int$need_int, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$need_int
## cwns2012$EPA_REGION | 1 | 2 | 3 | 4 | 5 | Row Total |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 42 | 51 | 63 | 83 | 126 | 365 |
## | 73.000 | 73.000 | 73.000 | 73.000 | 73.000 | |
## | 13.164 | 6.630 | 1.370 | 1.370 | 38.479 | |
## | 11.507% | 13.973% | 17.260% | 22.740% | 34.521% | 4.335% |
## | 2.494% | 3.029% | 3.741% | 4.929% | 7.482% | |
## | 0.499% | 0.606% | 0.748% | 0.986% | 1.496% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 105 | 109 | 130 | 150 | 120 | 614 |
## | 122.800 | 122.800 | 122.800 | 122.800 | 122.800 | |
## | 2.580 | 1.551 | 0.422 | 6.025 | 0.064 | |
## | 17.101% | 17.752% | 21.173% | 24.430% | 19.544% | 7.292% |
## | 6.235% | 6.473% | 7.720% | 8.907% | 7.126% | |
## | 1.247% | 1.295% | 1.544% | 1.781% | 1.425% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 125 | 165 | 239 | 258 | 244 | 1031 |
## | 206.200 | 206.200 | 206.200 | 206.200 | 206.200 | |
## | 31.976 | 8.232 | 5.217 | 13.013 | 6.929 | |
## | 12.124% | 16.004% | 23.181% | 25.024% | 23.666% | 12.245% |
## | 7.423% | 9.798% | 14.192% | 15.321% | 14.489% | |
## | 1.485% | 1.960% | 2.838% | 3.064% | 2.898% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 226 | 266 | 231 | 320 | 335 | 1378 |
## | 275.600 | 275.600 | 275.600 | 275.600 | 275.600 | |
## | 8.927 | 0.334 | 7.218 | 7.153 | 12.802 | |
## | 16.401% | 19.303% | 16.763% | 23.222% | 24.311% | 16.366% |
## | 13.420% | 15.796% | 13.717% | 19.002% | 19.893% | |
## | 2.684% | 3.159% | 2.743% | 3.800% | 3.979% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 311 | 415 | 424 | 347 | 255 | 1752 |
## | 350.400 | 350.400 | 350.400 | 350.400 | 350.400 | |
## | 4.430 | 11.910 | 15.459 | 0.033 | 25.974 | |
## | 17.751% | 23.687% | 24.201% | 19.806% | 14.555% | 20.808% |
## | 18.468% | 24.644% | 25.178% | 20.606% | 15.143% | |
## | 3.694% | 4.929% | 5.036% | 4.121% | 3.029% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 6 | 253 | 234 | 207 | 185 | 184 | 1063 |
## | 212.600 | 212.600 | 212.600 | 212.600 | 212.600 | |
## | 7.677 | 2.154 | 0.148 | 3.583 | 3.847 | |
## | 23.801% | 22.013% | 19.473% | 17.404% | 17.310% | 12.625% |
## | 15.024% | 13.895% | 12.292% | 10.986% | 10.926% | |
## | 3.005% | 2.779% | 2.458% | 2.197% | 2.185% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 7 | 427 | 251 | 205 | 126 | 92 | 1101 |
## | 220.200 | 220.200 | 220.200 | 220.200 | 220.200 | |
## | 194.215 | 4.308 | 1.049 | 40.298 | 74.638 | |
## | 38.783% | 22.797% | 18.619% | 11.444% | 8.356% | 13.076% |
## | 25.356% | 14.905% | 12.173% | 7.482% | 5.463% | |
## | 5.071% | 2.981% | 2.435% | 1.496% | 1.093% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 8 | 131 | 138 | 116 | 88 | 42 | 515 |
## | 103.000 | 103.000 | 103.000 | 103.000 | 103.000 | |
## | 7.612 | 11.893 | 1.641 | 2.184 | 36.126 | |
## | 25.437% | 26.796% | 22.524% | 17.087% | 8.155% | 6.116% |
## | 7.779% | 8.195% | 6.888% | 5.226% | 2.494% | |
## | 1.556% | 1.639% | 1.378% | 1.045% | 0.499% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 9 | 22 | 14 | 21 | 64 | 197 | 318 |
## | 63.600 | 63.600 | 63.600 | 63.600 | 63.600 | |
## | 27.210 | 38.682 | 28.534 | 0.003 | 279.804 | |
## | 6.918% | 4.403% | 6.604% | 20.126% | 61.950% | 3.777% |
## | 1.306% | 0.831% | 1.247% | 3.800% | 11.698% | |
## | 0.261% | 0.166% | 0.249% | 0.760% | 2.340% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 10 | 42 | 41 | 48 | 63 | 89 | 283 |
## | 56.600 | 56.600 | 56.600 | 56.600 | 56.600 | |
## | 3.766 | 4.300 | 1.307 | 0.724 | 18.547 | |
## | 14.841% | 14.488% | 16.961% | 22.261% | 31.449% | 3.361% |
## | 2.494% | 2.435% | 2.850% | 3.741% | 5.285% | |
## | 0.499% | 0.487% | 0.570% | 0.748% | 1.057% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## --------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1025.513 d.f. = 36 p = 6.99429e-192
##
##
##
## Minimum expected frequency: 56.6
Below are five variables from those we originally explored, divided into five equal intervals and tested for their association with total reported need.
We can confidently reject the null hypothesis that need and 2010 residential population density are independent variables due to the teeny tiny p-value (p = 5.427126e-207).
#First create a new column that breaks res_pop10_dens into distinct classes
#1 is low and 5 is high
cwns_int <- cwns_int %>%
mutate(res_pop10_dens = ntile(cwns2012$res_pop10_dens, 5))
#cross tab of res_pop10_dens and need
CrossTable(cwns_int$need_int, cwns_int$res_pop10_dens, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$res_pop10_dens
## cwns_int$need_int | 1 | 2 | 3 | 4 | 5 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 560 | 409 | 288 | 233 | 194 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 147.916 | 15.478 | 7.071 | 31.991 | 60.546 | |
## | 33.254% | 24.287% | 17.102% | 13.836% | 11.520% | 20.000% |
## | 33.254% | 24.287% | 17.102% | 13.836% | 11.520% | |
## | 6.651% | 4.857% | 3.420% | 2.767% | 2.304% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 410 | 425 | 362 | 274 | 213 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 15.909 | 23.098 | 1.886 | 11.710 | 45.506 | |
## | 24.347% | 25.238% | 21.496% | 16.271% | 12.648% | 20.000% |
## | 24.347% | 25.238% | 21.496% | 16.271% | 12.648% | |
## | 4.869% | 5.048% | 4.299% | 3.254% | 2.530% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 369 | 359 | 366 | 317 | 273 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 3.079 | 1.463 | 2.532 | 1.164 | 12.086 | |
## | 21.912% | 21.318% | 21.734% | 18.824% | 16.211% | 20.000% |
## | 21.912% | 21.318% | 21.734% | 18.824% | 16.211% | |
## | 4.382% | 4.264% | 4.347% | 3.765% | 3.242% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 252 | 320 | 377 | 371 | 364 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 21.351 | 0.838 | 4.798 | 3.473 | 2.197 | |
## | 14.964% | 19.002% | 22.387% | 22.031% | 21.615% | 20.000% |
## | 14.964% | 19.002% | 22.387% | 22.031% | 21.615% | |
## | 2.993% | 3.800% | 4.477% | 4.406% | 4.323% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 93 | 171 | 291 | 489 | 640 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 176.480 | 81.620 | 6.228 | 68.779 | 272.952 | |
## | 5.523% | 10.154% | 17.280% | 29.038% | 38.005% | 20.000% |
## | 5.523% | 10.154% | 17.280% | 29.038% | 38.005% | |
## | 1.105% | 2.031% | 3.456% | 5.808% | 7.601% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1020.148 d.f. = 16 p = 5.427126e-207
##
##
##
## Minimum expected frequency: 336.8
We can confidently say that need and Residential population density change from 1980 - 2010 are associated due to the smol p-value (p = 8.912258e-179).
#First create a new column that breaks res_pop10_dens into six classes
cwns_int <- cwns_int %>%
mutate(res_pop_dens_chng_80_10 = ntile(cwns2012$res_pop_dens_chng_80_10, 5))
#cross tab of res_pop10_dens and need
CrossTable(cwns_int$need_int, cwns_int$res_pop_dens_chng_80_10, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$res_pop_dens_chng_80_10
## cwns_int$need_int | 1 | 2 | 3 | 4 | 5 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 358 | 484 | 374 | 296 | 172 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 1.334 | 64.334 | 4.109 | 4.943 | 80.638 | |
## | 21.259% | 28.741% | 22.209% | 17.577% | 10.214% | 20.000% |
## | 21.259% | 28.741% | 22.209% | 17.577% | 10.214% | |
## | 4.252% | 5.748% | 4.442% | 3.515% | 2.043% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 379 | 414 | 393 | 297 | 201 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 5.288 | 17.695 | 9.378 | 4.703 | 54.755 | |
## | 22.506% | 24.584% | 23.337% | 17.637% | 11.936% | 20.000% |
## | 22.506% | 24.584% | 23.337% | 17.637% | 11.936% | |
## | 4.501% | 4.917% | 4.667% | 3.527% | 2.387% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 372 | 374 | 362 | 333 | 243 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 3.679 | 4.109 | 1.886 | 0.043 | 26.124 | |
## | 22.090% | 22.209% | 21.496% | 19.774% | 14.430% | 20.000% |
## | 22.090% | 22.209% | 21.496% | 19.774% | 14.430% | |
## | 4.418% | 4.442% | 4.299% | 3.955% | 2.886% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 339 | 290 | 337 | 336 | 382 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.014 | 6.503 | 0.000 | 0.002 | 6.066 | |
## | 20.131% | 17.221% | 20.012% | 19.952% | 22.684% | 20.000% |
## | 20.131% | 17.221% | 20.012% | 19.952% | 22.684% | |
## | 4.026% | 3.444% | 4.002% | 3.990% | 4.537% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 236 | 122 | 218 | 422 | 686 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 30.168 | 136.992 | 41.905 | 21.553 | 362.057 | |
## | 14.014% | 7.245% | 12.945% | 25.059% | 40.736% | 20.000% |
## | 14.014% | 7.245% | 12.945% | 25.059% | 40.736% | |
## | 2.803% | 1.449% | 2.589% | 5.012% | 8.147% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 888.2779 d.f. = 16 p = 8.912258e-179
##
##
##
## Minimum expected frequency: 336.8
We can confidently say that need and 2010 population size are associated due to the very small p-value (p = 5.214784e-217).
#First create a new column that breaks res_pop10 into six classes
cwns_int <- cwns_int %>%
mutate(POP10 = ntile(cwns2012$POP10, 5))
#cross tab of res_pop10_dens and need
CrossTable(cwns_int$need_int, cwns_int$POP10, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$POP10
## cwns_int$need_int | 1 | 2 | 3 | 4 | 5 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 557 | 402 | 290 | 245 | 190 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 143.967 | 12.622 | 6.503 | 25.021 | 63.985 | |
## | 33.076% | 23.872% | 17.221% | 14.549% | 11.283% | 20.000% |
## | 33.076% | 23.872% | 17.221% | 14.549% | 11.283% | |
## | 6.615% | 4.774% | 3.444% | 2.910% | 2.257% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 430 | 390 | 371 | 295 | 198 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 25.790 | 8.403 | 3.473 | 5.188 | 57.201 | |
## | 25.534% | 23.159% | 22.031% | 17.518% | 11.758% | 20.000% |
## | 25.534% | 23.159% | 22.031% | 17.518% | 11.758% | |
## | 5.107% | 4.632% | 4.406% | 3.504% | 2.352% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 363 | 385 | 363 | 305 | 268 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 2.038 | 6.898 | 2.038 | 3.002 | 14.054 | |
## | 21.556% | 22.862% | 21.556% | 18.112% | 15.914% | 20.000% |
## | 21.556% | 22.862% | 21.556% | 18.112% | 15.914% | |
## | 4.311% | 4.572% | 4.311% | 3.622% | 3.183% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 256 | 323 | 375 | 373 | 357 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 19.384 | 0.565 | 4.333 | 3.891 | 1.212 | |
## | 15.202% | 19.181% | 22.268% | 22.150% | 21.200% | 20.000% |
## | 15.202% | 19.181% | 22.268% | 22.150% | 21.200% | |
## | 3.040% | 3.836% | 4.454% | 4.430% | 4.240% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 78 | 184 | 285 | 466 | 671 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 198.864 | 69.323 | 7.967 | 49.562 | 331.620 | |
## | 4.632% | 10.926% | 16.924% | 27.672% | 39.846% | 20.000% |
## | 4.632% | 10.926% | 16.924% | 27.672% | 39.846% | |
## | 0.926% | 2.185% | 3.385% | 5.534% | 7.969% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 1066.906 d.f. = 16 p = 5.214784e-217
##
##
##
## Minimum expected frequency: 336.8
Interestingly, the p-value for need and Projected change in residential receiving treatment is exactly 0. This could mean they are very closely associated and warrants further testing.
#First create a new column that breaks res_pop10_dens into six classes
cwns_int <- cwns_int %>%
mutate(proj_change_res_rec_trmt = ntile(cwns2012$proj_change_res_rec_trmt, 5))
#cross tab of res_pop10_dens and need
CrossTable(cwns_int$need_int, cwns_int$proj_change_res_rec_trmt, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$proj_change_res_rec_trmt
## cwns_int$need_int | 1 | 2 | 3 | 4 | 5 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 462 | 555 | 372 | 207 | 88 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 46.541 | 141.364 | 3.679 | 50.024 | 183.793 | |
## | 27.435% | 32.957% | 22.090% | 12.292% | 5.226% | 20.000% |
## | 27.435% | 32.957% | 22.090% | 12.292% | 5.226% | |
## | 5.487% | 6.591% | 4.418% | 2.458% | 1.045% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 392 | 528 | 366 | 276 | 122 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 9.047 | 108.543 | 2.532 | 10.976 | 136.992 | |
## | 23.278% | 31.354% | 21.734% | 16.390% | 7.245% | 20.000% |
## | 23.278% | 31.354% | 21.734% | 16.390% | 7.245% | |
## | 4.656% | 6.271% | 4.347% | 3.278% | 1.449% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 347 | 347 | 455 | 349 | 186 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.309 | 0.309 | 41.482 | 0.442 | 67.520 | |
## | 20.606% | 20.606% | 27.019% | 20.724% | 11.045% | 20.000% |
## | 20.606% | 20.606% | 27.019% | 20.724% | 11.045% | |
## | 4.121% | 4.121% | 5.404% | 4.145% | 2.209% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 274 | 188 | 365 | 502 | 355 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 11.710 | 65.741 | 2.361 | 81.030 | 0.983 | |
## | 16.271% | 11.164% | 21.675% | 29.810% | 21.081% | 20.000% |
## | 16.271% | 11.164% | 21.675% | 29.810% | 21.081% | |
## | 3.254% | 2.233% | 4.335% | 5.962% | 4.216% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 209 | 66 | 126 | 350 | 933 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 48.494 | 217.733 | 131.938 | 0.517 | 1055.387 | |
## | 12.411% | 3.919% | 7.482% | 20.784% | 55.404% | 20.000% |
## | 12.411% | 3.919% | 7.482% | 20.784% | 55.404% | |
## | 2.482% | 0.784% | 1.496% | 4.157% | 11.081% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 2419.448 d.f. = 16 p = 0
##
##
##
## Minimum expected frequency: 336.8
The crosstable between need and projected change in residential receiving collection is also exactly 0, which suggests projected change in residential receiving collection and treatment are closely related to eachother.
#First create a new column that breaks the variable into six classes
cwns_int <- cwns_int %>%
mutate(proj_change_res_rec_collctn = ntile(cwns2012$proj_change_res_rec_collctn, 5))
#cross tab of var5 and need
CrossTable(cwns_int$need_int, cwns_int$proj_change_res_rec_collctn, fisher = FALSE, chisq = TRUE,
expected = TRUE, sresid = FALSE, format="SPSS")##
## Cell Contents
## |-------------------------|
## | Count |
## | Expected Values |
## | Chi-square contribution |
## | Row Percent |
## | Column Percent |
## | Total Percent |
## |-------------------------|
##
## Total Observations in Table: 8420
##
## | cwns_int$proj_change_res_rec_collctn
## cwns_int$need_int | 1 | 2 | 3 | 4 | 5 | Row Total |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 1 | 451 | 533 | 376 | 232 | 92 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 38.722 | 114.295 | 4.562 | 32.610 | 177.931 | |
## | 26.781% | 31.651% | 22.328% | 13.777% | 5.463% | 20.000% |
## | 26.781% | 31.651% | 22.328% | 13.777% | 5.463% | |
## | 5.356% | 6.330% | 4.466% | 2.755% | 1.093% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 2 | 385 | 504 | 390 | 276 | 129 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 6.898 | 83.004 | 8.403 | 10.976 | 128.209 | |
## | 22.862% | 29.929% | 23.159% | 16.390% | 7.660% | 20.000% |
## | 22.862% | 29.929% | 23.159% | 16.390% | 7.660% | |
## | 4.572% | 5.986% | 4.632% | 3.278% | 1.532% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 3 | 339 | 338 | 443 | 372 | 192 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 0.014 | 0.004 | 33.487 | 3.679 | 62.254 | |
## | 20.131% | 20.071% | 26.306% | 22.090% | 11.401% | 20.000% |
## | 20.131% | 20.071% | 26.306% | 22.090% | 11.401% | |
## | 4.026% | 4.014% | 5.261% | 4.418% | 2.280% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 4 | 273 | 205 | 349 | 488 | 369 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 12.086 | 51.577 | 0.442 | 67.878 | 3.079 | |
## | 16.211% | 12.173% | 20.724% | 28.979% | 21.912% | 20.000% |
## | 16.211% | 12.173% | 20.724% | 28.979% | 21.912% | |
## | 3.242% | 2.435% | 4.145% | 5.796% | 4.382% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 5 | 236 | 104 | 126 | 316 | 902 | 1684 |
## | 336.800 | 336.800 | 336.800 | 336.800 | 336.800 | |
## | 30.168 | 160.914 | 131.938 | 1.285 | 948.489 | |
## | 14.014% | 6.176% | 7.482% | 18.765% | 53.563% | 20.000% |
## | 14.014% | 6.176% | 7.482% | 18.765% | 53.563% | |
## | 2.803% | 1.235% | 1.496% | 3.753% | 10.713% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 1684 | 1684 | 1684 | 1684 | 1684 | 8420 |
## | 20.000% | 20.000% | 20.000% | 20.000% | 20.000% | |
## ------------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 2112.904 d.f. = 16 p = 0
##
##
##
## Minimum expected frequency: 336.8
Correlation tests were run for all the resilience measures, as well as for 2010 population, 2010 population density, 2009 median income, and present residential receiving treatment. Values with insignificant correlation with total official need were commented out of the code.
A number of variables exhibited no significant correlation with total official need at all. These were:
#cor(cwns2012$medinc_prcnt_chng_69_09, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")
#cor(cwns2012$res_pop_dens_prcnt_chng_80_10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")
#cor(cwns2012$proj_prcntchng_res_rec_trmt, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")
#cor(cwns2012$proj_prcntchng_res_rec_collctn, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")
#cor(cwns2012$proj_prcntchng_n_res_rec_trmt, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")
#cor(cwns2012$proj_prcntchng_n_res_rec_collctn, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")The following 12 variables exhibited measurable correlation with need (in ascending order): * medinc09 * PCTWHITE10 * proj_chng_n_res_rec_trmt * proj_chng_n_res_rec_collctn * POP10 * CSS * proj_change_res_rec_collctn * proj_change_res_rec_trmt * res_pop_dens_chng_80_10 * res_pop10_dens * PRES_RES_REC_COLLCTN * PRES_RES_REC_TRMT
ggmedinc <- ggplot(cwns2012, aes(x=log(MEDINC09), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Median Income by Official Need") +
xlab("Log-Adjusted Median Income (2009)") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5) +
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggwhite <- ggplot(cwns2012, aes(x=log(PCTWHITE10), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("% White by Official Need") +
xlab("Log-Adjusted % White (2010)") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5) +
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggmedinc, ggwhite, ncol = 2)Median income seems to have an extremely weak correlation with need based on its correlation coefficient (0.052) and clustered correlation pattern.
cor(cwns2012$MEDINC09, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.05235008
Percent white has substantially higher (but negative) correlation with need than median income (corr = -0.150) although the correlation plot above shows that the data skews toward a larger percent white.
cor(cwns2012$PCTWHITE10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] -0.1501574
ggnrrt <- ggplot(cwns2012, aes(x=log(proj_chng_n_res_rec_trmt), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Projected Change in Non-Res. Pop Receiving\nTreatment by Official Need") +
xlab("Log-Adjusted Projected Change in Non-Res Pop\nReceiving Treatment") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggnrrc <- ggplot(cwns2012, aes(x=log(proj_chng_n_res_rec_collctn), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Projected Change in Non-Res Pop Receiving\nCollection by Official Need") +
xlab("Log-Adjusted Projected Change in Non-Res Pop\nReceiving Collection") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggnrrt, ggnrrc, ncol = 2)Projected change in non-residential population receiving treatment has a slight correlation with need (corr = 0.162) but we see the data is still widely distributed.
cor(cwns2012$proj_chng_n_res_rec_trmt, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.1620348
Projected change in non-residential population receiving collection has nearly the same slight correlation (corr = 0.163).
cor(cwns2012$proj_chng_n_res_rec_collctn, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.1633675
ggres10 <- ggplot(cwns2012, aes(x=log(POP10), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Residential Population (2010) by Official Need") +
xlab("Log-Adjusted Residential Population (2010)") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggcss <- ggplot(cwns2012, aes(x=CSS, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("CSS Presence by Official Need") +
xlab("CSS Presence") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggres10, ggcss, ncol = 2)2010 population is in the same category of slight correlation (corr = 0.175) and wide distribution of points.
cor(cwns2012$POP10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.1754901
Presence of CSS has a more notable correlation with need, although it is still relatively small (corr = 0.204).
cor(cwns2012$CSS, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.204145
ggrrt <- ggplot(cwns2012, aes(x=log(proj_change_res_rec_trmt), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Projected Change in Res Pop Receiving\nTreatment by Official Need") +
xlab("Log-Adjusted Projected Change in Res Pop\nReceiving Treatment") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggrrc <- ggplot(cwns2012, aes(x=log(proj_change_res_rec_collctn), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Projected Change in Res Pop Receiving\nCollection by Official Need") +
xlab("Log-Adjusted Projected Change in Res Pop\nReceiving Collection") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggrrt, ggrrc, ncol = 2)Projected change in residential population receiving collection improves on CSS (corr = 0.217).
cor(cwns2012$proj_change_res_rec_collctn, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.217166
Projected change in residential population receiving treatment has a very similar notable but moderate correlation (corr = 0.226).
cor(cwns2012$proj_change_res_rec_trmt, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.226317
ggpdc <- ggplot(cwns2012, aes(x=log(res_pop_dens_chng_80_10), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Residential Pop Density Change by Official Need") +
xlab("Log-Adjusted Residential Pop Density Change (1980 - 2010)") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5) +
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggpd <- ggplot(cwns2012, aes(x=log(res_pop10_dens), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Residential Pop Density (2010) by Official Need") +
xlab("Log-Adjusted Residential Pop Density in 2010") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggpdc, ggpd, ncol = 2)Change in residential population density from 1980 to 2010 has a moderate correlation with need that is notably higher than the above variables (corr = 0.299). However the clustering of data makes it seem like both these variables are less reliable than the previous two.
cor(cwns2012$res_pop_dens_chng_80_10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.2998528
Although residential population density in 2010 still exhibits a moderate correlation with need overall, it’s correlation coefficient is significantly higher than the rest of the variables we tested (corr = 0.422).
cor(cwns2012$res_pop10_dens, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.4227108
ggprrc <- ggplot(cwns2012, aes(x=log(PRES_RES_REC_COLLCTN), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Present Res Pop Receiving\nCollection by Official Need") +
xlab("Log-Adjusted Present Res Pop Receiving Collection") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
ggprrt <- ggplot(cwns2012, aes(x=log(PRES_RES_REC_TRMT), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Present Res Pop Receiving\nTreatment by Official Need") +
xlab("Log-Adjusted Present Res Pop Receiving Treatment") + ylab("Log-Adjusted Official Need") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()
grid.arrange(ggprrc, ggprrt, ncol = 2)Present residential receiving collection also a correlation with need that is comparitively very high (corr = 0.529). The plot bears this out as well, with relatively tight clustering.
cor(cwns2012$PRES_RES_REC_COLLCTN, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.5294731
Present residential receiving treatment has the highest overall correlation with need (corr = 0.531).
cor(cwns2012$PRES_RES_REC_TRMT, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.5310858
The second phase of this project was to better understand the relationships between the variables we started in Part I and need by running bivariate and multivariate linear regression models.
In order to model infrastructure need, we needed variables that represented it. We selected the following variables to include in our models based on their association and correlation with need as well as their policy relevance:
To better understand Philadelphia’s especially high infrastructure need, we created the following additional variables:
cwns2012 <- cwns2012 %>%
mutate(ResBurden = TOTAL_OFFICIAL_NEED/max(PROJ_RES_REC_COLLCTN, PROJ_RES_REC_TRMT)) %>%
mutate(LogResBurden = ifelse(ResBurden > 0, log(ResBurden), 0)) %>%
mutate(LogPOP10 = ifelse(POP10 > 0, log(POP10), 0)) %>%
mutate(LogMEDINC09 = ifelse(MEDINC09 > 0, log(MEDINC09), 0))Residential Burden
ggresb <- cwns2012 %>%
ggplot(aes(x=factor(0), y=ResBurden)) +
ggtitle("Residential Burden") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
gglogresb <- cwns2012 %>%
ggplot(aes(x=factor(0), y=LogResBurden)) +
ggtitle("Log of Residential Burden") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
grid.arrange(ggresb, gglogresb, ncol = 2)
We created a ‘ResBurden’ variable to understand if Philadelphia’s high
need could be driven by an overstretched system designed to serve more
people than it currently does. We calculated residential burden as the
ratio of total need to residential population served. If this variable
were to be significant, its coefficient would tell us the expected
increase (or decrease) in total need based on the ratio of need to
residential pop served.
The box plot of ‘ResBurden’ shows some significant outliers, but the value for most facilities is near zero. We then created a log transformed ‘LogResBurden’ variable to deal with its extreme skew and get a more normal distribution. This shows us a log transformed mean below zero, meaning that the average facility need in dollars is lower than the amount of residences it serves.
LogPOP10
ggplogpop <- cwns2012%>%
ggplot(aes(x=factor(0), y=LogPOP10)) +
ggtitle("Log of 2010 Population") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggppop <- cwns2012%>%
ggplot(aes(x=factor(0), y=POP10)) +
ggtitle("2010 Population") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
grid.arrange(ggppop, ggplogpop, ncol = 2)
Because Philadelphia is one of the ten largest cities in the U.S. by
population, ‘POP10’ is an important variable for our analysis. A large
regression coefficient and high significance would indicate that
increases in population have an important affect on the need of the
facilities that serve them. Because our population data is also
significantly skewed by large outliers, we log transformed it as well by
creating a ‘LogPOP10’ variable.
The box plot of ’LogPOP10’also shows some significant outliers, but the mean is well above zero. In fact, the mean population per facility in this data set is 304,727 which is notably lower than the 508,668 persons served by each of Philadelphia’s three facilities.
LogMEDINC09
ggloginc <- cwns2012%>%
ggplot(aes(x=factor(0), y=LogMEDINC09)) +
ggtitle("Log of 2009 Median Income") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
gginc <- cwns2012%>%
ggplot(aes(x=factor(0), y=MEDINC09)) +
ggtitle("2009 Median Income") +
xlab("") +ylab("") +
geom_boxplot(fill="#6496b4", size=.35, color="#0059c7", alpha=0.8) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
grid.arrange(gginc, ggloginc, ncol = 2)Philadelphia is also one of the poorest large cities in the U.S. Therefore, income is an important variable to compare with need. A large and negative regression coefficient and high significance would indicate that increases in income have an important affect on need. Because our income data skews toward the top end of the distribution, we create a log transformed ‘LogMEDINC09’ variable to have a more normal distribution to include in our regression models.
## [1] 22053 54728 75415 129326
## [1] 494 585375 2504700 5194675 9818605
Suspecting that some factors influencing waste water infrastructure need in Philadelphia may be categorical rather than continuous, we created the three new binary dummy variables to include in our models. These are ‘GrowingCity’, ‘lowincome’, and ‘lrgcity’.
‘GrowingCity’ is meant to reflect the particular case of Philadelphia, which has experienced significant population growth since 1980.
Philadelphia is also a ‘lrgcity’, one of the 10 largest in the country, so we hypothesize being in this class of cities could also affect total need.
Finally, ‘lowincome’ is also relevant to Philadelphia because it is a relatively low-income city, so we hypothesize this might relate to its high infrastructure need as well.
Finally, we’ve created a list of 18 variables we think would be good candidates to run our models on. Here’s the list:
However, some of these variables might be too correlated with eachother or with total need to give us accurate predictions if they were to be in the same model. So, we ran pairwise correlations between variables (visualized with the correlogram below). This revealed that some of the variables we are interested in testing are highly correlated.
First, we examined the correlation coefficients between our candidate variables and total official need:
#make matrix of variables we're testing
mod_vars <- cwns2012 %>%
dplyr::select(TOTAL_OFFICIAL_NEED, log_TOTAL_NEED, res_pop10_dens, res_pop_dens_chng_80_10,
proj_change_res_rec_trmt, proj_change_res_rec_collctn, POP10, MEDINC09,
ResBurden, LogResBurden, PRES_RES_REC_TRMT, PRES_RES_REC_COLLCTN, LogPOP10, LogMEDINC09,
PCTWHITE10, CSS, GrowingCity, lowincome, lrgcity)
#compute correlation matrix
cormatrix <- cor(mod_vars) %>%
round(., 2)
#plot a correlogram
corrplot(cormatrix, method = "circle", type = "lower", order = "hclust",
tl.col = "black", tl.srt = 45)Specifically, the following variables exhibited high degrees of significance with each other:
We then looked at all the correlation scores between each of our candidate variables and official need to help us select further:
print('res_pop10_dens')## [1] "res_pop10_dens"
cor(cwns2012$res_pop10_dens, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.4227108
print('res_pop_dens_chng_80_10')## [1] "res_pop_dens_chng_80_10"
cor(cwns2012$res_pop_dens_chng_80_10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.2998528
print('proj_change_res_rec_trmt')## [1] "proj_change_res_rec_trmt"
cor(cwns2012$proj_change_res_rec_trmt, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.226317
print('proj_change_res_rec_collctn')## [1] "proj_change_res_rec_collctn"
cor(cwns2012$proj_change_res_rec_collctn, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.217166
print('POP10')## [1] "POP10"
cor(cwns2012$POP10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.1754901
print('LogPOP10')## [1] "LogPOP10"
cor(cwns2012$LogPOP10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.1845537
print('PRES_RES_REC_TRMT')## [1] "PRES_RES_REC_TRMT"
cor(cwns2012$PRES_RES_REC_TRMT, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.5310858
print('PCTWHITE10')## [1] "PCTWHITE10"
cor(cwns2012$PCTWHITE10, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] -0.1501574
print('ResBurden')## [1] "ResBurden"
cor(cwns2012$ResBurden, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 1
print('LogResBurden')## [1] "LogResBurden"
cor(cwns2012$LogResBurden, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.3944091
print('LogOfficialNeed')## [1] "LogOfficialNeed"
cor(cwns2012$log_TOTAL_NEED, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.2819446
print('MEDINC09')## [1] "MEDINC09"
cor(cwns2012$MEDINC09, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.05235008
print('LogMEDINC09')## [1] "LogMEDINC09"
cor(cwns2012$LogMEDINC09, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.05545658
print('CSS')## [1] "CSS"
cor(cwns2012$CSS, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.204145
print('Growing City')## [1] "Growing City"
cor(cwns2012$GrowingCity, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] 0.04273358
print('Low Income')## [1] "Low Income"
cor(cwns2012$lowincome, cwns2012$TOTAL_OFFICIAL_NEED, use="complete.obs", method="pearson")## [1] -0.04949486
We found that ResBurden has a correlation of 1 with Total Official Need, meaning Total Official Need is a function of ResBurden. Therefore, we exclude it from our regression analysis moving forward.
Based on correlation with Total Official Need, and elimination of one of a variable pair when colinearity is present, favoring the one with higher correlation with need, we narrowed down this list of variables to 7 independent variables.
Final variable list to test: * PRES_RES_REC_TRMT (cor with need = 0.53) * res_pop10_dens (cor with need = 0.42) * CSS (cor with need = 0.20) * LogPOP10 (cor with need = 0.18) * PCTWHITE10 (cor with need = -0.15) * LogMEDINC09 (cor with need = 0.05) * GrowingCity (cor with need = .04)
Second variable list to test: * PRES_RES_REC_TRMT (cor with need = 0.53) * res_pop10_dens (cor with need = 0.42) * CSS (cor with need = 0.20) * LogPOP10 (cor with need = 0.18) * PCTWHITE10 (cor with need = -0.15) * lowincome (cor with need = -0.05)
We then ran bivariate regressions to further clarify the relationships between ‘TOTAL_OFFICIAL_NEED’ and each of the variables considered above.
ggplot(cwns2012, aes(x=log(PRES_RES_REC_TRMT), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Present Residential Receiving Treatment by Official Need") +
xlab("Present Residential Receiving Treatment (Log-Adjusted)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()Need vs PRES_RES_REC_TRMT appears to be significant and its coefficient suggests a one unit increase residences recieving treatment leads to a $670 increase in need.
bimod1 <- lm(TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT, data = cwns2012)
summary(bimod1)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.981e+09 -7.600e+06 -5.839e+06 -3.348e+06 4.666e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.109e+06 1.271e+06 4.808 1.55e-06 ***
## PRES_RES_REC_TRMT 6.702e+02 1.165e+01 57.507 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113900000 on 8418 degrees of freedom
## Multiple R-squared: 0.2821, Adjusted R-squared: 0.282
## F-statistic: 3307 on 1 and 8418 DF, p-value: < 2.2e-16
ggplot(cwns2012, aes(x=log(res_pop10_dens), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Residential Population Density by Official Need") +
xlab("2010 Residential Population Density (Log-Adjusted)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()Need vs. res_pop10_dens also appears to be significant and to have a larger effect, per unit increase, on need.
bimod2 <- lm(TOTAL_OFFICIAL_NEED ~ res_pop10_dens, data = cwns2012)
summary(bimod2)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ res_pop10_dens, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.604e+09 -1.077e+07 -4.791e+06 -1.807e+06 4.641e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3963635 1390596 2.85 0.00438 **
## res_pop10_dens 106509 2489 42.80 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 121900000 on 8418 degrees of freedom
## Multiple R-squared: 0.1787, Adjusted R-squared: 0.1786
## F-statistic: 1831 on 1 and 8418 DF, p-value: < 2.2e-16
ggplot(cwns2012, aes(x=CSS, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Combined Sewer System by Official Need") +
xlab("Combined Sewer System (CSS)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()Not understanding why but the presence of CSS is associated with higher need, based on the intercept, and is statistically significant. However, this it is consistent with our earlier findings.
bimod3 <- lm(TOTAL_OFFICIAL_NEED ~ CSS, data = cwns2012)
summary(bimod3)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ CSS, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -113053301 -12918966 -11507847 -6999299 5047055225
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13384564 1497686 8.937 <2e-16 ***
## CSS 99668737 5209214 19.133 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131600000 on 8418 degrees of freedom
## Multiple R-squared: 0.04168, Adjusted R-squared: 0.04156
## F-statistic: 366.1 on 1 and 8418 DF, p-value: < 2.2e-16
ggplot(cwns2012, aes(x=LogPOP10, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Population 2010 by Official Need") +
xlab("2010 Population (Log-Adjusted)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()Need vs. LogPOP10 also appears to be significant and to have a large effect, per unit increase, on need.
bimod4 <- lm(TOTAL_OFFICIAL_NEED ~ LogPOP10, data = cwns2012)
summary(bimod4)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ LogPOP10, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93823689 -28786621 -10297145 4815460 5087721673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -155142772 10360561 -14.97 <2e-16 ***
## LogPOP10 15565206 903447 17.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132100000 on 8418 degrees of freedom
## Multiple R-squared: 0.03406, Adjusted R-squared: 0.03395
## F-statistic: 296.8 on 1 and 8418 DF, p-value: < 2.2e-16
ggplot(cwns2012, aes(x=log(PCTWHITE10), y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Percent White 2010 by Official Need") +
xlab("2010 Percent White (Log-Adjusted)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()PCTWHITE10 vs. Need shows us there is a large negative relationship between a 1% increase in white population and need.
bimod5 <- lm(TOTAL_OFFICIAL_NEED ~ PCTWHITE10, data = cwns2012)
summary(bimod5)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PCTWHITE10, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111964571 -23211575 -8028143 -837409 5081602461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131280054 8001470 16.41 <2e-16 ***
## PCTWHITE10 -1329320 95395 -13.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132900000 on 8418 degrees of freedom
## Multiple R-squared: 0.02255, Adjusted R-squared: 0.02243
## F-statistic: 194.2 on 1 and 8418 DF, p-value: < 2.2e-16
ggplot(cwns2012, aes(x=LogMEDINC09, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Median Income 2009 by Official Need") +
xlab("2009 Median Income (Log-Adjusted)") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()LogMEDINC09 vs. need shows this variable to also be significant and also be associated with a large increase in need per unit increase in log adjusted income.
bimod6 <- lm(TOTAL_OFFICIAL_NEED ~ LogMEDINC09, data = cwns2012)
summary(bimod6)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ LogMEDINC09, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45510168 -21916836 -16396530 -9616514 5136067175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -306662575 64437347 -4.759 1.98e-06 ***
## LogMEDINC09 29965961 5880332 5.096 3.55e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134300000 on 8418 degrees of freedom
## Multiple R-squared: 0.003075, Adjusted R-squared: 0.002957
## F-statistic: 25.97 on 1 and 8418 DF, p-value: 3.545e-07
ggplot(cwns2012, aes(x=GrowingCity, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Growing City by Official Need") +
xlab("Growing City") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()GrowingCity vs. need shows that this variable is also significant and growing cities are associated with higher need (based on the intercept).
bimod7 <- lm(TOTAL_OFFICIAL_NEED ~ GrowingCity, data = cwns2012)
summary(bimod7)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ GrowingCity, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25228553 -23889301 -18307143 -10089971 5134879973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12468172 2754174 4.527 6.06e-06 ***
## GrowingCity 12760382 3251569 3.924 8.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134300000 on 8418 degrees of freedom
## Multiple R-squared: 0.001826, Adjusted R-squared: 0.001708
## F-statistic: 15.4 on 1 and 8418 DF, p-value: 8.765e-05
ggplot(cwns2012, aes(x=lowincome, y=log(TOTAL_OFFICIAL_NEED))) +
ggtitle("Low Income City by Official Need") +
xlab("Low Income") + ylab("Official Need (Log-Adjusted)") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5)+
geom_smooth(method=lm, color="#0059c7") +
theme_bw()lowincome vs. need shows that this variable is also significant and low-income cities are associated with higher need (based on the intercept).
bimod8 <- lm(TOTAL_OFFICIAL_NEED ~ lowincome, data = cwns2012)
summary(bimod8)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ lowincome, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27577432 -25159092 -13888102 -10430813 5132531094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27577432 1963883 14.042 < 2e-16 ***
## lowincome -13390627 2945125 -4.547 5.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134300000 on 8418 degrees of freedom
## Multiple R-squared: 0.00245, Adjusted R-squared: 0.002331
## F-statistic: 20.67 on 1 and 8418 DF, p-value: 5.525e-06
Finally, we run a series of multivariate regressions to measure the degree to which the independent variables we hypothesized to be significant are linearly related to need.
For our first model, we begin with a selection of all our hypothesized significant variables that do not exhibit multi-collinearity according to our correlation tests:
#Interpreting regression results - key statistics:
#R-squared
# Tells what % of the variability in the dependent variable is explained by the model
# in other words - r-squared measures the strength of the relationship btwn the model
# and dependent variable on a 0-100% scale
# so, higher r-squared values represent smaller differences btwn the observed data
# and the fitted values.
# r-squared < 50% is common when studying human behavior
# IMPORTANT NOTE: r-squared increases every time you add an independent variable to the model.
#Adjusted R-squared
# Adjusts for the number of terms in the model!
# Its value increases only when the new term improves the model fit more than expected by chance alone.
# The adjusted R-squared value actually decreases when the term doesn’t improve the model fit enough.
# Useful for comparing goodness-of-fit for regression models with differing numbers of independent variables.
#Multiple R-squared
#Significance codes
#F-statistic
# a large f-stat means the model has a strong fit and we can reject the
# null hypothesis that a null model (just the intercept) describes the data better.
# If the p-value associated with the F-stat is ≥ 0.05 Then there is no relationship between ANY of the
# independent variables and Y.
# If the p-value associated with the F-statistic < 0.05 then AT LEAST 1 independent variable is related to Y.
mod1 <- lm (TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS + LogPOP10 +
PCTWHITE10 + LogMEDINC09 + GrowingCity, data = cwns2012)
summary(mod1)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens +
## CSS + LogPOP10 + PCTWHITE10 + LogMEDINC09 + GrowingCity,
## data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.666e+09 -8.758e+06 -1.936e+05 7.050e+06 4.355e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.873e+08 5.975e+07 4.809 1.54e-06 ***
## PRES_RES_REC_TRMT 5.646e+02 1.173e+01 48.157 < 2e-16 ***
## res_pop10_dens 7.691e+04 2.483e+03 30.978 < 2e-16 ***
## CSS 4.756e+07 4.387e+06 10.842 < 2e-16 ***
## LogPOP10 -5.189e+06 1.046e+06 -4.961 7.15e-07 ***
## PCTWHITE10 6.580e+04 8.962e+04 0.734 0.462871
## LogMEDINC09 -2.301e+07 6.190e+06 -3.717 0.000203 ***
## GrowingCity 1.408e+07 2.831e+06 4.975 6.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106400000 on 8412 degrees of freedom
## Multiple R-squared: 0.3737, Adjusted R-squared: 0.3732
## F-statistic: 717.1 on 7 and 8412 DF, p-value: < 2.2e-16
Interpreting the Results of Model 1 The R-squared and adjusted R-squared tell us this first model describes ~37% of the variance in our data. There is room for improvement, but we are capturing some significant relationships. The p-value associated with the model’s F-statistic is low, which tells us the model describes the data better than the mean. Every variable tested appears to be significant except for ‘PCTWHITE10’. Therefore, Model 1 could perform better, but seems like a promising start.
Next, we employ automated backwards selection on model 1 to see which, if any variables we can drop to improve the fit and efficiency of the model. We choose backwards selection, rather than forwards selection, because it allows us to use the entire variable list we developed previously as our starting point. This is preferable to forward selection, which would have required us to arbitrarily pick and then build off of a starting variable.
#AIC
# Particularly helpful for backwards selection
# it's a measure of the quality of a model and used to pick a model from many.
# AIC is all relative - just tells you what's the best one you have.
# the preferred model has the lowest AIC. And AIC increases with the number of explanatory variables.
mod2 <- step(mod1, direction = "backward")## Start: AIC=311264
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS +
## LogPOP10 + PCTWHITE10 + LogMEDINC09 + GrowingCity
##
## Df Sum of Sq RSS AIC
## - PCTWHITE10 1 6.1070e+15 9.5317e+19 311262
## <none> 9.5311e+19 311264
## - LogMEDINC09 1 1.5654e+17 9.5468e+19 311276
## - LogPOP10 1 2.7886e+17 9.5590e+19 311287
## - GrowingCity 1 2.8040e+17 9.5592e+19 311287
## - CSS 1 1.3318e+18 9.6643e+19 311379
## - res_pop10_dens 1 1.0873e+19 1.0618e+20 312172
## - PRES_RES_REC_TRMT 1 2.6276e+19 1.2159e+20 313312
##
## Step: AIC=311262.5
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS +
## LogPOP10 + LogMEDINC09 + GrowingCity
##
## Df Sum of Sq RSS AIC
## <none> 9.5317e+19 311262
## - LogMEDINC09 1 1.5477e+17 9.5472e+19 311274
## - GrowingCity 1 2.7689e+17 9.5594e+19 311285
## - LogPOP10 1 3.7357e+17 9.5691e+19 311293
## - CSS 1 1.3740e+18 9.6691e+19 311381
## - res_pop10_dens 1 1.1135e+19 1.0645e+20 312191
## - PRES_RES_REC_TRMT 1 2.6345e+19 1.2166e+20 313315
The results of automated backwards selection tell us we can drop ‘PCTWHITE10’ because the AIC is lower for the model when it is not included, showing that a model without PCTWHITE10 has a better fit with the data. This gives us ‘mod2’.
anova(mod1, mod2)## Analysis of Variance Table
##
## Model 1: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS +
## LogPOP10 + PCTWHITE10 + LogMEDINC09 + GrowingCity
## Model 2: TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS +
## LogPOP10 + LogMEDINC09 + GrowingCity
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8412 9.5311e+19
## 2 8413 9.5317e+19 -1 -6.107e+15 0.539 0.4629
Comparing ‘mod1’ and ‘mod2’ with an analysis of Variance Table gives a p-value much larger than the 0.05 threshold for statistically significant difference between the model’s means. Therefore, while ‘mod2’’s AIC suggests it performs slightly better, running ANOVA suggests the models don’t perform differently from eachother in a meaningful way.
summary(mod2)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens +
## CSS + LogPOP10 + LogMEDINC09 + GrowingCity, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.665e+09 -8.961e+06 -1.381e+05 7.108e+06 4.355e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.788e+08 5.860e+07 4.757 1.99e-06 ***
## PRES_RES_REC_TRMT 5.640e+02 1.170e+01 48.221 < 2e-16 ***
## res_pop10_dens 7.659e+04 2.443e+03 31.349 < 2e-16 ***
## CSS 4.795e+07 4.354e+06 11.012 < 2e-16 ***
## LogPOP10 -5.498e+06 9.575e+05 -5.742 9.67e-09 ***
## LogMEDINC09 -2.140e+07 5.791e+06 -3.696 0.00022 ***
## GrowingCity 1.397e+07 2.827e+06 4.944 7.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106400000 on 8413 degrees of freedom
## Multiple R-squared: 0.3737, Adjusted R-squared: 0.3732
## F-statistic: 836.6 on 6 and 8413 DF, p-value: < 2.2e-16
Looking at ‘mod2’‘s r-squared values, we see they are in fact the same as ’mod1’. This means at the very least it is not worth keeping PCTWHITE10 in the model.
What if we replace median income with our binary low income variable?
mod3 <- lm (TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS + LogPOP10 + lowincome + GrowingCity, data = cwns2012)
summary(mod3)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens +
## CSS + LogPOP10 + lowincome + GrowingCity, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.665e+09 -8.609e+06 3.779e+04 6.751e+06 4.359e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.065e+07 1.087e+07 4.662 3.19e-06 ***
## PRES_RES_REC_TRMT 5.659e+02 1.168e+01 48.430 < 2e-16 ***
## res_pop10_dens 7.623e+04 2.443e+03 31.210 < 2e-16 ***
## CSS 4.797e+07 4.357e+06 11.011 < 2e-16 ***
## LogPOP10 -6.220e+06 9.320e+05 -6.674 2.64e-11 ***
## lowincome 6.509e+06 2.724e+06 2.389 0.0169 *
## GrowingCity 1.256e+07 2.784e+06 4.513 6.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106500000 on 8413 degrees of freedom
## Multiple R-squared: 0.3731, Adjusted R-squared: 0.3726
## F-statistic: 834.5 on 6 and 8413 DF, p-value: < 2.2e-16
The r-squared got slightly worse. What does automated backward selection tell us?
#AIC
# Particularly helpful for backwards selection
# it's a measure of the quality of a model and used to pick a model from many.
# AIC is all relative - just tells you what's the best one you have.
# the preferred model has the lowest AIC. And AIC increases with the number of explanatory variables.
step(mod3, direction = "backward")## Start: AIC=311270.4
## TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens + CSS +
## LogPOP10 + lowincome + GrowingCity
##
## Df Sum of Sq RSS AIC
## <none> 9.5407e+19 311270
## - lowincome 1 6.4733e+16 9.5472e+19 311274
## - GrowingCity 1 2.3101e+17 9.5638e+19 311289
## - LogPOP10 1 5.0518e+17 9.5913e+19 311313
## - CSS 1 1.3750e+18 9.6782e+19 311389
## - res_pop10_dens 1 1.1047e+19 1.0645e+20 312191
## - PRES_RES_REC_TRMT 1 2.6599e+19 1.2201e+20 313339
##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens +
## CSS + LogPOP10 + lowincome + GrowingCity, data = cwns2012)
##
## Coefficients:
## (Intercept) PRES_RES_REC_TRMT res_pop10_dens CSS
## 50653385.2 565.9 76232.2 47970410.4
## LogPOP10 lowincome GrowingCity
## -6220473.7 6508827.8 12564463.2
Backward selection tells us the model’s fit is better with low income included than without. But, since ‘lowincome’ and ‘LogMEDINC09’ can’t be in the same model and the r-squared was higher for the model with ‘LogMEDINC09’ we choose to move forward with the original income variable rather than the low income dummy variable.
What if we try interacting terms from model 2 instead?
mod4 <- lm (TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens:CSS + LogPOP10 + LogMEDINC09 + GrowingCity,
data = cwns2012)
summary(mod4)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + res_pop10_dens:CSS +
## LogPOP10 + LogMEDINC09 + GrowingCity, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.635e+09 -7.564e+06 -4.137e+06 7.104e+05 4.281e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.445e+07 5.763e+07 1.465 0.1429
## PRES_RES_REC_TRMT 5.602e+02 1.152e+01 48.628 <2e-16 ***
## LogPOP10 1.392e+06 9.075e+05 1.534 0.1251
## LogMEDINC09 -9.061e+06 5.713e+06 -1.586 0.1127
## GrowingCity 5.271e+06 2.753e+06 1.915 0.0556 .
## res_pop10_dens:CSS 9.526e+04 2.489e+03 38.274 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105100000 on 8414 degrees of freedom
## Multiple R-squared: 0.3899, Adjusted R-squared: 0.3895
## F-statistic: 1075 on 5 and 8414 DF, p-value: < 2.2e-16
Interacting ‘res_pop10_dens’ with ‘css’ improves the fit, but a number of variables lose their significance. What happens if we drop all non-significant terms? In this case, ‘LogPOP10’, ‘LogMEDINC09’, and ‘GrowingCity’.
mod5 <- lm (TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + CSS + res_pop10_dens:CSS,
data = cwns2012)
summary(mod5)##
## Call:
## lm(formula = TOTAL_OFFICIAL_NEED ~ PRES_RES_REC_TRMT + CSS +
## res_pop10_dens:CSS, data = cwns2012)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.627e+09 -4.746e+06 -2.689e+06 -2.985e+05 4.281e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.966e+06 1.211e+06 2.449 0.0144 *
## PRES_RES_REC_TRMT 5.604e+02 1.111e+01 50.460 < 2e-16 ***
## CSS 2.355e+07 4.344e+06 5.421 6.08e-08 ***
## CSS:res_pop10_dens 9.183e+04 2.566e+03 35.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 104900000 on 8416 degrees of freedom
## Multiple R-squared: 0.3915, Adjusted R-squared: 0.3913
## F-statistic: 1805 on 3 and 8416 DF, p-value: < 2.2e-16
Dropping these three variables only reduces the r-squared of ‘mod5’ by 0.0005 and the adjusted r-squared by 0.0003. Also, the p-values associated with the f-statistic of the two models also stays the same, therefore, ‘mod5’ is a leaner and almost as mean model as ‘mod4’ that performs effectively just as well but doesn’t carry the weight of insignificant variables.
In order to select our final model, we create a table of each of the five models to compare their fitness (below). The table shows that models 4 and 5 are our best models. They have the highest r-squared and adjusted r-squared values, the lowest residual standard errors, and the largest F Statistics. We select model 5 as our final model because it achieves equal performance to model 4 with fewer variables and has the highest F Statistic of any of our models.
stargazer(mod1, mod2, mod3, mod4, mod5, type="html", digits = 2, single.row = FALSE)| Dependent variable: | |||||
| TOTAL_OFFICIAL_NEED | |||||
| (1) | (2) | (3) | (4) | (5) | |
| PRES_RES_REC_TRMT | 564.64*** | 564.05*** | 565.89*** | 560.18*** | 560.40*** |
| (11.73) | (11.70) | (11.68) | (11.52) | (11.11) | |
| res_pop10_dens | 76,912.20*** | 76,587.53*** | 76,232.24*** | ||
| (2,482.80) | (2,443.03) | (2,442.53) | |||
| CSS | 47,562,261.00*** | 47,952,953.00*** | 47,970,410.00*** | 23,550,121.00*** | |
| (4,386,951.00) | (4,354,435.00) | (4,356,575.00) | (4,344,062.00) | ||
| LogPOP10 | -5,189,207.00*** | -5,498,256.00*** | -6,220,474.00*** | 1,392,187.00 | |
| (1,045,997.00) | (957,525.80) | (931,999.00) | (907,537.20) | ||
| PCTWHITE10 | 65,795.96 | ||||
| (89,620.63) | |||||
| LogMEDINC09 | -23,007,712.00*** | -21,402,858.00*** | -9,060,779.00 | ||
| (6,189,815.00) | (5,790,812.00) | (5,712,505.00) | |||
| lowincome | 6,508,828.00** | ||||
| (2,724,301.00) | |||||
| GrowingCity | 14,081,263.00*** | 13,974,312.00*** | 12,564,463.00*** | 5,271,457.00* | |
| (2,830,593.00) | (2,826,764.00) | (2,783,825.00) | (2,753,013.00) | ||
| res_pop10_dens:CSS | 95,260.57*** | ||||
| (2,488.94) | |||||
| CSS:res_pop10_dens | 91,829.97*** | ||||
| (2,566.12) | |||||
| Constant | 287,326,251.00*** | 278,793,804.00*** | 50,653,385.00*** | 84,449,347.00 | 2,966,206.00** |
| (59,746,000.00) | (58,603,110.00) | (10,866,273.00) | (57,633,418.00) | (1,211,300.00) | |
| Observations | 8,420 | 8,420 | 8,420 | 8,420 | 8,420 |
| R2 | 0.37 | 0.37 | 0.37 | 0.39 | 0.39 |
| Adjusted R2 | 0.37 | 0.37 | 0.37 | 0.39 | 0.39 |
| Residual Std. Error | 106,444,347.00 (df = 8412) | 106,441,430.00 (df = 8413) | 106,491,691.00 (df = 8413) | 105,050,944.00 (df = 8414) | 104,898,777.00 (df = 8416) |
| F Statistic | 717.12*** (df = 7; 8412) | 836.59*** (df = 6; 8413) | 834.48*** (df = 6; 8413) | 1,075.30*** (df = 5; 8414) | 1,804.85*** (df = 3; 8416) |
| Note: | p<0.1; p<0.05; p<0.01 | ||||
print('VIF')## [1] "VIF"
vif(mod5) # variance inflation factors ## PRES_RES_REC_TRMT CSS CSS:res_pop10_dens
## 1.071149 1.094941 1.142076
print('sqrt(VIF)')## [1] "sqrt(VIF)"
sqrt(vif(mod5))## PRES_RES_REC_TRMT CSS CSS:res_pop10_dens
## 1.034963 1.046394 1.068679
Finally, we check the VIF for Model 5 to ensure we have not selected a final model that has colinear independent variables. This test shows that the model’s variables do not exhibit multi-colinearity because all of their VIF values are below 5 and their square root VIF values are below 2.
After examining the relationship between total budgeted facility need and what feels like countless variables for this assignment, it seems surprising that just three variables describe variation in need better than any other combination we tested. However, when we think about which variables they are and the fact that the data is national, the results of our process make more sense.
The process in question, in which we went from a total of 48 potentially significant independent variables down to 7 to run multivariate regressions on, and eventually down to the 3 most significant, revealed that present residential population receiving treatment, residential population density, and presence of a combined sewer system are key drivers of wastewater facility budgetary need in the United States.
| Independent Vars. | Estimate | Std. Error | t value | Pr(> |
|---|---|---|---|---|
| (Intercept) | 2,966,000 | 1211000 | 2.449 | 0.0144 |
| PRES_RES_REC_TRMT | $560 | 11.1 | 50.460 | < 0.00000000000000002 |
| CSS | $23,550,000 | 4344000 | 5.421 | 0.0000000608 |
| CSS:res_pop10_dens | 91,830 | 2566 | 35.786 | < 0.00000000000000002 |
Specifically, the coefficients of our final model (shown in the table above) predict that:
For every 1 unit increase in residential population receiving treatment, all else equal, we can expect a $565 increase in facility budgetary need. While this value may seem relatively small, it grows quickly. Especially when considering facilities like Philadelphia’s which serve an average of 760,534 people. If an additional 1,000 people move into that district, we predict that facility’s need to increase by $565,000.
Since “CSS” is a binary variable based on the presence or absence of a combined sewer system, and the interaction between CSS and residential population density is highly significant (due to its small p-value), the effect of residential population density on budgeted facility need is only significant when that facility has a CSS.
Because CSS is a binary dummy variable, we interpret its standalone coefficient differently. It tells us that the average increase in need for facilities with a CSS instead of MS4 is a shocking $23,550,000.
It is also important to recognize that the effect of CSS on the model is shaped by its interaction with residential population density. In this model, when combined sewers are not present (CSS = 0) the slope of the regression line equals the coefficient for ‘PRES_RES_REC_TRMT’.
Overall Model The r-squared value of this model is 0.3915, which means that the predictor variables explain about 39% of the variability in the response variable (TOTAL_OFFICIAL_NEED).
#fitted values are predicted values (in this case, need)
mod5_resid <- augment(mod5)
gg5resfull <- ggplot(mod5_resid, aes(x = .fitted, y = .resid)) +
ggtitle("Model 5 Residuals") +
xlab("Fitted") + ylab("Residuals") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#a02300") +
geom_vline(aes(xintercept=mean(TOTAL_OFFICIAL_NEED)), color="#E1B93C") +
geom_vline(aes(xintercept=mean(philly_facilities$TOTAL_OFFICIAL_NEED)), color="#a02300") +
theme_bw()
gg5rescut <- ggplot(mod5_resid, aes(x = .fitted, y = .resid)) +
ggtitle("Model 5 Residuals (Cropped)") +
xlab("Fitted") + ylab("Residuals") +
geom_point(size=1, color="black", fill="#6496b4", shape=21, alpha=.5) +
geom_hline(yintercept = 0, linetype = "dashed", color = "#a02300") +
geom_vline(aes(xintercept=mean(TOTAL_OFFICIAL_NEED)), color="#E1B93C") +
geom_vline(aes(xintercept=mean(philly_facilities$TOTAL_OFFICIAL_NEED)), color="#a02300") +
xlim(0, 200000000) +
theme_bw()
grid.arrange(gg5resfull, gg5rescut, ncol = 2)
Plotting the residuals against the predicted or “fitted” values of our
final model reveals a clustering around lower values, or
heteroscedasticity. Plotting the residuals of an ideal model would show
homoscedasticity - a random yet equal distribution around the regression
line. The heteroscedasticity in our model means it predicts better (is
more accurate) at low values of need than at high ones.
The above plots show mean facility need in the United States with a vertical yellow line and mean facility need for Philadelphia with a red line. The plot on the left includes the full range of data and shows that unfortunately the model does not predict very accurately when need is in the range of Philadelphia’s. However, the zoomed in plot on the right shows that the residuals are much more symmetric around the regression line for values closer to mean U.S. facility need. This suggests that our model is useful for estimating need in cities with average need.
In summary, the results of this model are of significance to cities with large large numbers of present residences receiving treatment, combined sewers, and higher than average population density. Philly fits all of these descriptors. The city’s population density and residences receiving treatment are projected to increase, and it has a CSS. This makes clear that Philly’s need for water infrastructure investment, while currently quite high, will only continue to increase in the future. Therefore, investment in this infrastructure is an absolute MUST!